REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  NO.  0704-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comment  regarding  this  burden  estimates  or  any  other  aspect  of  this 
collection  of  information,  including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  information  Operations  and  Reports,  1215  Jefferson 
Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 


1.  AGENCY  USE  ONLY  (Leave  blank)  12.  REPORT  DATE 


TITLE  AND  SUBTITLE 


3.  REPORT  TYPE  AND  DATES  COVERED 


-Reprint' 


5.  FUNDING  NUMBERS 


TITLE  ON  REPRINT- 


6.  AUTHOR(S) 


AUTHOR(S) ' 


DAAG55-98- 1-0230 


7.  PERFORMING  ORGANIZATION  NAMES(S)  AND  ADDRESS(ES) 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


University  of  Texas  -  Austin 


9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

U.S.  Army  Research  Office 
P.O.  Box  12211 

Research  Triangle  Park,  NC  27709-2211 


10.  SPONSORING /MONITORING 
AGENCY  REPORT  NUMBER 


ARO  37634.29-PH 


11.  SUPPLEMENTARY  NOTES 


The  views,  opinions  and/or  findings  contained  in  this  report  are  those  of  the  author(s)  and  should  not  be  const 
an  official  Department  of  the  Army  position,  policy  or  decision,  unless  so  designated  by  other  documentation. 


construed  as 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 


12  b.  DISTRIBUTION  CODE 


Approved  for  public  release;  distribution  unlimited. 


13.  ABSTRACT  (Maximum  200  words) 


ABSTRACT  ON-REPRB^T* 


20011101  131 


17.  SECURITY  CLASSIFICATION 
OR  REPORT 

UNCLASSIFIED 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

UNCLASSIFIED 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

UNCLASSIFIED 


15.  NUMBER  IF  PAGES 


16.  PRICE  CODE 


20.  LIMITATION  OF  ABSTRACT 


NSN  7540-01-280-5500 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  239-18 
298-102 


To  appearin:  Soft-computing  and  image  processing,  S.K.  Pal  and  A.  Ghosh,  and  M.  K.  Kundu,  Eds.,  Springer 

Knowledge  Reuse  Mechanisms  for 
Categorizing  Related  Image  Sets 


Kurt  D.  Bollacker  and  Joy  deep  Ghosh 

Department  of  Electrical  and  Computer  Engineering, 
University  of  Texas  at  Austin, 

Austin,  TX  78712 

{kdb  ,ghosh  }  @lans .  ece.  ut  exas .  edu 


Abstract.  This  chapter  introduces  the  concept  of  classifier  knowledge  reuse  as  a 
means  of  exploiting  domain  knowledge  taken  from  old,  previously  created,  relevant 
classifiers  to  assist  in  a  new  classification  task.  Knowledge  reuse  helps  in  construct¬ 
ing  better  generalizing  classifiers  given  few  training  examples  and  for  evaluating 
images  for  search  in  an  image  database.  In  particular,  we  discuss  a  knowledge  reuse 
framework  in  which  a  supra- classifier  improves  the  performance  of  the  target  clas¬ 
sifier  using  information  from  existing  support  classifiers .  Soft  computing  methods 
can  be  used  for  all  three  types  of  classifiers  involved.  We  explore  supra- classifier  de¬ 
sign  issues  and  introduce  several  types  of  supra-classifiers,  comparing  their  relative 
strengths  and  weaknesses.  Empirical  examples  on  real  world  image  data  sets  are 
used  to  demonstrate  the  effectiveness  of  the  supra-classifier  framework  for  classifi¬ 
cation  and  retrieval/search  in  image  databases. 

Keywords:  knowledge  reuse,  image  classification,  image  database,  curse  of 
dimensionality,  soft  classifiers 


1  Introduction 

1.1  A  Priori  Knowledge  for  Image  Classification 

The  development  of  computer  vision  systems  that  can  perform  as  well  as  hu¬ 
mans  has  proven  to  be  an  extremely  difficult  task.  One  of  the  reasons  often 
cited  for  this  is  the  difficulty  in  giving  artificial  vision  systems  enough  domain 
knowledge  to  handle  the  complexity  of  real-world  image  understanding  tasks. 
One  of  the  most  important  image  understanding  problems  that  suffers  from 
this  drawback  is  image  classification,  i.e.  the  task  of  building  a  system  that 
can  distinguish  one  category  of  images  from  another.  As  an  example,  con¬ 
sider  the  problem  of  distinguishing  images  of  males  from  images  of  females. 
Considerations  of  physiology,  customs  of  clothing  design,  grooming  habits, 
and  other  less  tangible  concepts  are  often  leveraged  by  humans  making  such 
a  decision.  Understanding  what  relevant  knowledge  is  available  and  how  it 
can  be  included  in  an  image  classification  system  is  still  an  open  problem. 
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Exemplar  based  inductive  image  classifiers  try  to  generalize  from  a  given 
training  set.  They  can  utilize  two  types  of  knowledge  sources:  raw  image 
data  and  a  priori  knowledge  about  the  image  data  set.  The  raw  image  data 
may  consist  of  an  array  of  pixel  intensity  values  (for  grayscale  images)  or 
color  intensity  information  (for  color  images).  Images  represented  in  this  or 
a  similar  fashion  potentially  contain  a  very  large  amount  of  information  (UA 
picture  is  worth  a  thousand  words”),  but  this  information  is  difficult  to  handle 
without  other  a  priori  knowledge.  Generally,  using  pixel  value  information 
directly  for  image  classification  is  extremely  difficult,  if  not  impossible  due 
to  the  extremely  high  dimensional  input  space. 

A  priori  knowledge  about  the  image  data  set  is  simply  information  about 
the  data  set  that  is  external  to  the  data  itself.  This  information  can  be  used  in 
several  capacities  in  the  construction  of  an  image  classifier.  One  of  the  most 
important  uses  of  a  priori  knowledge  is  for  feature  extraction.  For  example, 
the  knowledge  that  a  set  of  images  are  from  indoor  office  scenes  might  influ¬ 
ence  to  what  degree  edges  would  be  considered  germane  features  since  they 
tend  to  occur  more  commonly  in  man-made  objects.  A  priori  knowledge  can 
also  be  used  to  choose  image  sets  or  learner  architectures,  or  even  modify  the 
learning  process  itself.  For  example,  Bayesian  approaches  to  image  classifica¬ 
tion  (e.g.  [21])  use  a  priori  knowledge  in  the  form  of  prior  class  probabilities 
and  prior  distributions  assumed  for  the  model  parameters.  Some  classifier 
architectures  use  the  structure  and  value  of  model  parameters  to  represent  a 
priori  knowledge  (e.g.  the  discriminant  function  in  statistical  classifiers  [12], 
size  and  order  of  features  in  decision  trees  [23],  and  the  type  and  number 
of  hidden  units,  amount  and  form  of  regularization  in  feed-forward  neural 
networks  [13]).  Such  approaches  can  work  very  well  if  the  resulting  inductive 
bias  matches  the  problem  very  closely.  However,  in  practice  it  may  be  quite 
difficult  to  use  this  type  of  knowledge  to  select  and  tune  a  proper  model. 
Also,  standard  assumptions  used  (independence  among  variables,  Gaussian 
distributions,  etc.)  to  make  the  problem  tractable  often  result  in  a  loss  of 
accuracy  [16,21]. 

1.2  Classifier  Knowledge  Reuse 

Besides  training  data  and  a  priori  knowledge,  previously  constructed  classi¬ 
fiers  or  labelers  of  images  axe  a  third  type  of  knowledge  source  to  consider  for 
image  classifier  construction.  This  source  is  essentially  a  product  of  the  first 
two.  Image  labels  may  have  been  present  when  the  image  database  was  orig¬ 
inally  created,  or  subsequently  determined  during  later  classification  tasks. 
Alternatively,  labels  may  result  from  a  partitioning  of  the  input  image  space 
induced  by  a  different  data  set /categorization  combination.  In  all  three  cases, 
the  labels  contain  knowledge  derived  from  both  previous  image  data  sets  and 
a  priori  information.  If  this  knowledge  is  relevant  to  the  current  classification 
task,  then  it  can  be  used  to  built  a  better  classifier.  In  Figure  1  salient  fea¬ 
tures  (e.g.  size,  color,  shape)  have  been  extracted  from  some  unknown  and 


To  appear  in:  Soft- computing  andAmage  processing ,  S.K.  Pal  and  A.  Ghosh,  and  M.  K.  Kundu,  Eds.,  Springer- Verlag,  1 


Fig.  1.  Knowledge  transfer  between  related  tasks. 


perhaps  currently  unavailable  image  data  set,  and  an  artificial  classifier  has 
been  built  to  discriminate  images  of  grapefruit  from  images  of  pears.  We  refer 
to  a  previously  constructed  classifier  as  a  support  classifier.  We  wish  to  con¬ 
struct  a  new  target  classifier  to  discriminate  images  of  oranges  from  images  of 
apples.  Given  that  the  same  features  are  available,  we  can  present  images  of 
apples  and  oranges  to  the  grapefruit/pear  classifier  and  observe  its  resulting 
behavior.  In  this  case,  since  apples  have  similarity  to  pears  and  oranges  have 
similarity  to  grapefruit,  we  expect  that  the  grapefruit /pear  classifier  should 
be  able  to  provide  some  indication  as  to  whether  we  are  showing  it  a  image 
of  an  apple  or  an  orange. 

1.3  Characteristics  of  Classifier  Knowledge  Reuse 

Classifier  Knowledge  Reuse  is  the  idea  that  knowledge  embedded  in  a  pre¬ 
viously  created  set  of  classifiers  can  be  used  to  build  a  new  classifier  that 
performs  better  than  one  which  simply  uses  its  current  training  data  and 
any  available  a  priori  knowledge  for  the  current  task.  This  is  most  effective 
when  there  is  insufficient  information  in  the  current  training  set  and  a  priori 
sources,  and  thus  knowledge  from  classifier  reuse  can  supplement  existing 
knowledge.  For  example,  if  there  are  too  few  or  noisy  training  images,  then 
statistics  over  this  training  set  may  be  difficult  to  estimate,  resulting  in  poor 
learning.  If  there  is  too  little  a  priori  information,  then  the  feature  space  for 
artificial  learners  may  become  too  noisy  or  too  large  (high  dimensional)  to 
be  searched  effectively. 

Besides  assisting  in  new  classification  problems,  classifier  knowledge  reuse 
has  another,  more  interesting  application.  In  traditional  image  classification 
problems,  the  goal  is  to  build  classifiers  that  generalize  well  to  new,  unseen 
images  the  classifier  may  encounter.  Thus,  the  classification  task  is  static  but 
the  image  set  of  interest  is  dynamic.  Consider  the  converse  to  this;  a  static 
image  set  and  a  changing  set  of  classification  tasks  corresponding  to  newer 
uses  of  or  studies  on  the  same  data.  The  goal  in  this  application  is  to  be  able 
to  create  a  knowledge  base  for  future  understanding  and  search  in  a  fixed  set 
of  images. 
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As  an  example,  consider  the  Mars  Pathfinder  images  gathered  recently 
by  NASA.  At  the  end  of  the  mission,  the  set  of  available  images  does  not 
grow  or  change.  However,  a 5  science  progresses  and  further  analyses  are  per¬ 
formed,  knowledge  about  this  set  in  the  form  of  classifications  of  the  images 
may  increase  over  time.  The  set  of  previous  classifications  can  function  as 
a  “knowledge  profile”  about  each  of  the  images.  When  a  researcher  wants 
to  find  all  of  the  images  that  fit  a  particular  profile,  he/she  could  manually 
classify  some  of  the  images  as  positive  and  negative  examples  of  the  concept 
being  searched  for.  Knowledge  from  the  previous  classifications  could  be  used 
to  build  a  new  classifier  that  can  make  decisions  on  the  image  set  and  retrieve 
the  images  that  have  been  classified  as  positive  examples.  If  the  image  set 
is  static,  then  the  previous  classifiers  no  longer  need  to  be  available,  as  only 
the  class  labels  that  they  generated  are  important.  Thus,  humans,  automated 
classifiers  with  a  limited  lifespan  (e.g.  the  Pathfinder  probe),  and  other  tem¬ 
porary  types  of  classifiers  can  be  used.  If  the  image  set  is  not  static  and  the 
previous  classifiers  are  still  available,  then  new  images  can  also  be  searched. 
An  example  of  this  type  of  knowledge  reuse  is  presented  later  in  this  chapter. 

2  Methods  of  Classifier  Knowledge  Reuse 

We  first  briefly  survey  some  existing  research  into  architecture  specific  clas¬ 
sifier  knowledge  reuse.  Most  of  this  work  has  focused  on  knowledge  transfer 
mechanisms  that  use  multilayer  perceptron  (MLP)  neural  network  classifiers 
and  has  explored  two  main  mechanisms;  (i)  knowledge  re-representation  and 
(ii)  sharing  internal  state  information.  The  benefits  and  limitations  of  these 
existing  approaches  are  discussed,  and  then  a  broader  framework  for  general 
classifier  reuse  is  introduced. 

2.1  Knowledge  Re-representation 

In  the  context  of  knowledge  reuse,  knowledge  re-representation  is  the  con¬ 
cept  that  knowledge  about  a  classification  task  is  extracted  from  a  classifier 
in  some  new  representation  that  is  suitable  for  insertion  into  later  classi¬ 
fiers.  There  has  been  much  work  on  knowledge  intensive  learning  focusing  on 
symbolic  rules  extracted  from  and  used  in  the  creation  of  neural  classifiers 
(e.g.  [11,14,22,31,34]).  If  knowledge  can  be  represented  as  rules,  then  these 
rules  may  be  inserted  into  other  neural  networks.  Often,  these  rules  are  used 
to  initialize  or  adjust  the  structure  or  weights  in  multilayer  perceptron  neu¬ 
ral  networks.  These  approaches  have  resulted  in  better  performing  classifiers 
and/or  classifiers  which  can  be  trained  more  quickly.  However,  these  rule  ex¬ 
traction  approaches  cannot  reuse  easily  from  non-MLP  classifiers  and  have 
not  demonstrated  scalability  to  the  cases  where  the  number  of  relevant  rules 
is  very  large. 

Although  less  popular,  there  have  been  other  re-representation  schemes 
investigated.  In  [32],  a  neural  network  is  used  to  recognize  previously  learned 
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concepts  in  order  to  estimate  the  probability  of  an  old  class  being  presented  as 
input  when  training  a  new  classifier.  The  explanation-based  neural  network 
algorithm  (EBNN)  is  used  to  train  the  classifier  for  the  current  classification 
task  by  using  the  target  function  derivative  information  (the  re-represented 
knowledge)  to  augment  the  learning  process.  In  another,  unrelated  work  [33], 
a  scaling  vector  for  a  nearest  neighbor  classifier  is  learned  for  one  classification 
task  and  reused  for  another,  related  task. 


2.2  Internal  State  Sharing 

Instead  of  re-representing  knowledge,  some  knowledge  reuse  research  has  fo¬ 
cused  on  reusing  internal  state  information,  namely  weight  values  in  MLP 
style  neural  networks.  Under  the  belief  that  related  classification  tasks  may 
benefit  from  common  internal  features,  Caruana  [6]  has  created  an  MLP 
based  multiple  classifier  system  that  is  trained  simultaneously  to  perform 
several  related  classification  tasks.  In  this  work  on  two  layer  neural  networks, 
the  first  layer  is  shared  by  several  related  classification  tasks.  The  premise 
is  that  related  tasks  have  similar  connectionist  representations  in  the  weight 
space,  and  that  by  training  on  more  and  a  wider  variety  of  samples  (be¬ 
cause  there  are  multiple  training  sets),  these  representations  can  be  better 
learned.  The  second  layer  of  this  neural  network  is  separated  and  indepen¬ 
dent  for  each  classification  task.  Improved  classification  performance  has  been 
demonstrated  in  some  cases.  Baxter  [2]  has  developed  a  rigorous  analysis  of 
a  similar  type  of  architecture,  showing  that  as  the  number  of  simultaneously 
trained  tasks  increases,  the  number  of  examples  needed  per  task  for  good  gen¬ 
eralization  decreases.  These  knowledge  sharing  methods  are  not  knowledge 
reuse  by  our  previous  definition  since  all  of  the  classification  tasks  must  be 
created  simultaneously,  but  share  many  of  its  qualities.  More  closely  match¬ 
ing  the  knowledge  reuse  definition  is  work  by  Pratt  [25],  in  which  some  of  the 
trained  weights  from  one  MLP  network  trained  for  a  single  task  are  used  to 
initialize  weights  in  an  MLP  to  be  trained  for  a  later,  related  task.  Improved 
training  speed  has  been  shown  for  this  reuse  method. 


2.3  Supra-Classifier  Knowledge  Reuse 

We  now  describe  a  general  framework  for  classifier  knowledge  reuse  recently 
introduced  in  [3,4].  The  Supra- classifier  knowledge  reuse  framework  is  a  sim¬ 
ple  two  layer  structure  which  allows  the  reuse  of  knowledge  from  any  type 
and  quantity  of  previously  created  classifiers.  These  classifiers  ultimately 
share  the  same  input  domain  as  the  new  classification  task  of  interest,  al¬ 
though  they  may  operate  on  different  features  extracted  from  the  images. 
The  supra-classifier  knowledge  reuse  process  is  to  present  the  images  to  all 
available  previously  trained  classifiers  and  then  use  the  resulting  output  vec- 
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tor  of  classification  labels  as  the  input  for  a  second  stage  supra-classifier1 . 
This  supra-classifier  then  makes  the  final  classification  decision  for  the  cur¬ 
rent  target  classification  task  (cT(-)  in  Figure  2.)  Previously  trained  classifiers 


Final  Classification  Ct(x) 


Common  Domain  X  € 


Fig.  2.  A  Supra- Classifier  based  knowledge  reuse  framework. 


are  termed  support  classifiers.  Support  classifiers  are  generally  (but  not  al¬ 
ways)  designed  for  tasks  other  than  the  current  target  classification  task  of 
interest.  In  Figure  2,  two  of  the  three  support  classifiers  are  for  different  tasks, 
and  one  has  been  constructed  for  the  current  classification  task  of  interest 
using  only  the  training  set. 

While  Figure  2  may  appear  to  bear  a  superficial  resemblance  to  recent 
popular  approaches  such  as  stacking  [35],  committees,  ensembles  [15,17,27], 
and  mixtures  of  experts[19,20,26],  the  supra-classifier  is  fundamentally  differ¬ 
ent  from  these  “combiner”  approaches.  The  supra-classifier  is  a  generaliza¬ 
tion  on  combining  where  the  support  classifier  could  be  designed  for  different 
tasks,  and  are  immutable,  having  been  trained  previously.  Support  classifiers 
for  ensembles/combiners  try  to  solve  the  same  classification  task  (though  they 
may  differentiated  by  input  regions  or  feature  selection)  and  are  not  previ¬ 
ously  created  classifiers.  Techniques  like  combining  and  stacking  are  simply 
good  methods  of  decomposing  a  classification  task  into  simpler  tasks  and 
generally  do  not  reuse  previous  knowledge. 

There  is  a  simple  probabilistic  intuition  to  explain  why  the  supra-classifier 
can  effectively  reuse  knowledge  from  previously  constructed  relevant  classi- 


1  This  restriction  allows  one  to  use  any  type  of  support  classifier  since  internal 
information  is  not  needed. 
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fiers.  Suppose  that  each  image  for  a  new  classification  task  is  represented  as 
a  point  in  a  two-dimensional  feature  space.  Let  there  be  two  target  classes 
of  images,  X  and  0,  in  this  space,  and  let  a  distribution  of  image  samples 
be  represented  in  Figure  3.  Suppose  there  is  a  previously  trained  (support) 


Fig.  3.  Knowing  the  support  classifier  labels  (indicated  by  the  grey  levels)  helps  to 
guess  the  target  class. 

classifier  that  divides  the  feature  space  into  three  regions;  black,  dark  gray, 
and  light  gray.  In  Figure  3  these  regions  are  separated  by  dotted  lines  and  the 
X  and  0  points  in  these  regions  are  colored  appropriately.  In  the  example 
here,  knowing  that  the  support  classifier  label  is  black  for  a  particular  image 
gives  a  good  indication  that  the  target  class  for  that  image  is  probably  X. 
Thus,  knowing  the  support  classifier  label  has  helped  guess  the  target  class 
label  correctly  with  greater  probability.  A  formal  treatment  of  this  result  is 
presented  in  [5]. 


3  Supra-Classifier  Design 


The  supra-classifier  design  process  is  dependent  on  the  specifics  of  the  training 
image  set  and  support  classifiers  since  it  should  be  able  to  use  both  of  these 
knowledge  sources  effectively  to  maximize  classification  accuracy.  We  now 
discuss  the  criteria  of  size  of  the  training  set  and  number  of  support  classifiers 
to  guide  the  construction  of  the  supra-classifier  and  compare  different  supra- 
classifier  approaches  in  the  context  of  these  criteria. 
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3.1  Space  of  Knowledge  Sources 

Supra-classifiers  make  classification  decisions  on  a  vector  of  categorical  (class 
label)  values.  Just  like  any  normal  classifier,  they  use  the  target  training 
samples  (and  a  priori  information  if  available)  to  make  a  decision.  In  a  nor¬ 
mal  classifier,  typically  the  feature  set  is  static,  and  to  improve  classification 
performance  more  and/or  better  training  samples  are  needed.  In  contrast 
to  this,  the  premise  of  the  supra-classifier  framework  is  that  knowledge  can 
also  be  added  by  increasing  the  number  of  relevant  support  classifiers  (input 
features).  Thus,  although  the  design  of  a  supra-ciassifier  is  closely  tied  to 
that  of  making  a  normal  classifier  with  a  discrete  input  space,  there  is  the 
additional  design  goal  of  being  able  to  perform  better  when  more  features 
are  available,  especially  in  the  scenario  of  a  static  training  set  size.  Consider 
the  “knowledge  space”  of  Figure  4.  Points  in  this  space  qualitatively  repre- 


Much  Relevant 
Knowledge  about 
the  Target 
Classification  Task 


Fig.  4.  Hypothetical  space  of  available  knowledge  about  the  target  classification 
task. 


sent  the  amount  of  knowledge  that  is  available  in  a  target  training  sample 
set  to  a  supra-classifier.  With  more  “good”  samples  or  support  classifiers,  the 
amount  of  knowledge  increases.  Goodness  depends  on  certain  desirable  condi¬ 
tions  such  as  independence  and  random  sampling.  The  hypothetical  greyscale 
shown  has  contours  where  the  knowledge  relevant  to  the  target  classification 
task  (as  reflected  by  the  ideal  achievable  error  rate)  is  equal  in  quantity.  The 
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supra-classifier  designer  must  know  where  in  this  knowledge  space  he/she  is 
working,  and  choose  a  supra-classifier  that  functions  well  in  that  part  of  the 
space. 

We  now  set  up  the  mathematical  framework  for  supra-classifiers,  describe 
several  potential  supra-classifier  architectures,  and  discuss  where  in  the  above 
knowledge  space  they  are  the  most  appropriate.  We  also  discuss  some  tech¬ 
niques  that  expand  the  region  of  usefulness  in  the  knowledge  space  for  some 
of  these  classifiers. 

3.2  Definitions 

Let  the  target  classification  task  be  r,  and  let  r  have  discrete  range  ST  and 
d  dimensional  input  domain  space  9?d.  Let  {x,y}T  :  x  E  $ld,y  €  ST  be  the 
set  of  training  examples  for  task  r.  We  assume  that  {x,y}T  is  a  sample 
set  from  the  true  distribution  for  task  r  having  associated  random  variable 
(XT,  Yt )  E  ST).  Our  goal  is  to  find  the  most  likely  value  of  the  conditional 
marginal  Yr\(Xr  =  x)  and  define  this  maximum  likelihood  function  to  be 
t(x)  =  argmax?/P(yr  =  y\XT  =  x).  Thus,  £(•)  :  £(•)  E  ST  is  the  target 
function  that  we  would  like  to  approximate  using  the  information  in  {x,y}T. 
Let  B  be  a  set  of  support  classification  tasks  which  have  the  same  input 
domain  space  S Rd  as  task  r.  Let  {c&(-)}  :  b  6  B  be  the  corresponding  set  of 
classifiers  where  each  c&(-)  maps  Sb  :  b  €  B?  Let  XT  be  the  random 

variable  associated  with  the  input  values  of  training  sample  set  {x,y}T.  Let 
Tr  :  Tt  =  tT(Xr)  be  defined  as  the  random  variable  associated  with  the 
target  function  of  XT.  Similarly,  let  Cb  :  Cb  =  Cb{XT)  be  the  random  variables 
resulting  from  the  application  of  XT  to  the  support  classifiers. 

3.3  Probability  Estimate  Based  Supra-Classifiers 

One  of  the  most  common  and  compelling  approaches  to  constructing  a  supra- 
classifier  is  to  perform  probability  estimates  on  the  discrete  feature  space 
of  support  classifier  labels.  An  ideal  probability  based  supra-classifier  c*(x) 
will  always  choose  the  most  likely  class  of  the  y  E  Sr  given  the  class  labels 
{c&(a;)}  :  b  E  B  (maximum  posterior  probability).  More  specifically,  for  any 
given  set  of  values  {z&  :  E  Sb}  :  b  E  B  we  can  define  the  maximum 

probability  function  mr(-)  as 

mT({zb  :  zb  E  <S6}  :  b  E  B)  =  argmaxP(Tr  =  y\{Cb  =  zb}  :  b  E  B).  (1) 

We  can  then  define  an  ideal  classifier  based  on  this  maximum  probability 
function  as 


c*(x)  =  mT({cb(x)  :  cb(x)  E  Sb}  :  b  E  B).  (2) 

2  Although  some  of  the  support  classifiers  may  have  been  trained  for  task  r  directly, 
in  general  b  ^  r  and  ST  #  Sb,  as  the  tasks  are  different. 
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where  c*(-)  has  associated  random  variable  C*  :  C*  =  c*(XT).  If  the  number 
of  support  classifiers  is  small  and  the  number  of  target  training  samples  is 
large  (the  upper  left  corner  of  Figure  3),  then  this  ideal  supra- classifier  can 
be  built  by  estimating  probabilities  directly  from  the  sample  probabilities. 
However,  if  the  number  of  support  classifiers  is  quite  large,  Equation  2  is 
not  directly  scalable  due  to  the  curse  of  dimensionality  [10],  One  aspect  of 
this  “curse”  is  the  fact  that  in  order  to  maintain  a  constant  confidence  in 
sample  based  probability  estimates  as  the  dimensionality  (number  of  support 
classifiers)  goes  up,  one  must  have  an  exponentially  increasing  number  of 
samples. 

Thus,  in  practice,  approximating  approaches  to  Equation  2  are  usually 
required.  Most  of  these  approximations  use  assumptions  on  and/or  a  pri¬ 
ori  knowledge  about  the  structure  of  dependencies  between  support  classi¬ 
fiers.  Adding  knowledge  in  this  manner  can  reduce  the  dimensionality  of  the 
probabilities  to  be  estimated.  Examples  of  this  include  belief  networks[24] 
and  log-linear  modeling[7].  While  this  structuring  generally  requires  a  priori 
knowledge  specific  to  a  particular  classification  task,  some  common  assump¬ 
tions  such  as  independence  among  support  classifier  labels  conditional  on  the 
target  classification  task  are  often  made. 


The  Naive  Bayes  Classifier.  If  the  independence  assumption  truly  holds, 
then  probability  based  supra-classifier  types  restricted  to  the  upper  left  cor¬ 
ner  of  Figure  4  can  move  rightward  (toward  more  support  classifiers)  and 
downward  (toward  less  samples)  to  some  degree  without  compromising  clas¬ 
sification  performance.  The  best  known  (and  possibly  simplest)  classifier  that 
takes  advantage  of  this  is  the  Naive  Bayes  classifier [23].  Bayes  rule  states  that 


P(Tr\{Cb}  :beB)  = 


P({cb} :  b  €  g|rr)P(rr) 
P({Cb}:beB) 


(3) 


Given  the  conditional  independence  assumption,  the  conditional  term  of  the 
numerator  in  Equation  3  can  be  calculated  as 


P({Cb}  :  b  e  B\Tr)  =  n  P(Cb\Tr). 

b£B 


The  term  P(Tt)  can  be  assumed  constant  (equal  priors),  estimated  from  the 
samples,  or  estimated  from  a  priori  information.  The  denominator  of  Equa¬ 
tion  3  can  be  assumed  to  be  constant  (equal  prior  support  class  probabilities), 
but  is  often  calculated  as 


P({Cb}  :  b  €  B)  =  £  P(Tr  =  y)  P{Ch\Tr  =  y), 

y€Sr  beB 

where  the  conditional  independence  assumption  is  made  once  again.  The 
probabilities  in  the  RHS  of  Equation  3  can  be  estimated  from  the  training 
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samples  and  be  substituted  into  Equation  1.  The  ideal  classifier  of  Equation  2 
can  then  be  calculated  from  these  estimates.  This  supra-classifier  is  equivalent 
to  the  ideal  classifier  if  the  conditional  independence  assumption  holds. 


Bayes  classifier  with  Feature  Selection.  If  there  are  many  training  sam¬ 
ples  and  few  support  classifiers,  then  Equation  2  can  be  estimated  directly. 
However,  if  there  are  many  support  classifiers,  but  most  of  them  are  irrelevant 
for  to  the  target  classification  task,  then  a  process  of  feature  selection  can  be 
used  to  eliminate  the  less  useful  features.  This  corresponds  to  moving  the 
target  classification  task  leftward  in  Figure  4,  making  direct  probability  esti¬ 
mates  easier.  Feature  selection  requires  making  a  judgement  on  which  subset 
of  the  support  classifiers  of  a  given  size  is  optimal  for  supra-classifier  accu¬ 
racy.  In  general,  this  is  a  well  studied  problem,  and  finding  the  best  feature 
selection  method  often  depends  on  the  target  classification  task. 


Bayes  Classifier  with  Smoothing.  Rather  than  assuming  independence  or 
hoping  that  most  of  the  support  classifiers  are  irrelevant  and  can  be  excluded, 
it  is  also  possible  to  use  smoothing  to  make  better  probability  estimates  of  the 
target  classes  conditional  on  the  support  class  labels.  The  kernel  method  for 
probability  smoothing  introduced  in  [1]  allows  estimation  of  the  joint  target 
and  support  classifier  label  probabilities  P(Tr  =  y,{Cb  =  xb}  :  b  E  B)  , 
which  is  proportional  to  P(Tr  =  y\{Cb  =  £&}  :  b  €  B)  (the  conditional  target 
class  probabilities).  Suppose  there  are  \B\  support  classifiers  and  n  images  as 
training  samples.  The  kernel  smoothing  function  can  be  written  as: 

x  n  \B\ 

P(Tt  =  7/,  {Cb  =  xb}  :  b  e  B)  =  -  T  TT  K(i,  6,  A),  (4) 

nST=i 

where  K(i,b,  A)  is  a  kernel  function  and  A  is  a  smoothing  factor.  The  sum 
is  over  all  the  set  of  training  images  xs  :  $  =  1 . . .  n  and  the  product  is  over 
all  \B\  support  classifiers.  The  kernel  for  a  test  support  classifier  label  vector 
xtest  is  defined  as: 


A)  —  A,  q>(xs)  —  cb{p^test) 

—  c6(#s)  7^  cb(%test) 


(5) 


where  x8  is  the  sth  training  image  and  \Sb\  is  the  number  of  different  class 
labels  for  support  classifier  b.  A  is  defined  only  on  max&  ^  <  A  <  1.  The 
case  of  A  =  1  means  there  is  no  smoothing,  and  with  a  large  number  of 
support  classifiers  (lower  right  corner  of  Figure  4),  would  mean  that  most 
of  the  probability  estimates  would  almost  certainly  be  zero.  Since  we  are 
interested  in  the  left  hand  side  of 


P(TT  =  y\{Cb=xh}:beB)  = 


P(Tt  =  y ,  { Cb  =  xb}  :  b  €  B) 
P({Cb  =  xb}  :  b  e  B) 


(6) 
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and  the  denominator  of  the  right  hand  side  of  Equation  6  is  constant  for  a 
given  image,  if  we  simply  calculate  the  numerator  of  Equation  6  for  all  possi¬ 
ble  target  values  and  take  the  largest,  we  are  performing  a  direct  estimation 
of  the  ideal  probabilistic  supra-classifier  Equation  in  2. 


3.4  Combiner  Based  Supra- Classifiers 

Combiner  or  ensemble  based  classifiers  axe  systems  which  use  the  classifi¬ 
cation  decisions  of  many  simultaneous  target  classifiers  and  “combine”  their 
decisions  into  a  final  decision.  Combiners  have  been  extensively  researched 
(see  [29]  for  a  survey),  and  we  so  we  only  introduce  a  simple  application  to 
the  supra-classifier  framework  here.  Consider  the  ideal  classifier  of  Equation 
2  constructed  using  only  a  single  support  classifier  b  and  call  this  a  “voting” 
classifier  c£°*e  (x)  as  defined  by 

cvbote(x)  =  mT({cb(x)}:beB),  (7) 

where  r  is  the  target  task.  The  voting  classifier  makes  a  guess  at  the  most 
likely  target  class  based  only  on  the  information  from  one  support  classifier; 
in  essence  this  is  the  support  classifier’s  “vote”  for  the  correct  target  class. 
The  supra-classifier  consists  of  tallying  all  \B\  votes  and  choosing  the  tar¬ 
get  class  with  the  most  votes.  While  we  avoid  the  problems  of  making  high 
dimensional  probability  estimates,  this  supra-classifier  is  sensitive  to  noisy 
voters.  If  a  few  “good  voters”  make  correct  choices  most  of  the  time,  they 
may  be  overwhelmed  by  those  voters  which  are  essentially  guessing  randomly, 
or  always  choosing  the  target  class  with  the  highest  prior  probability.  Thus, 
it  may  be  desirable  to  weight  voters  by  their  accuracy  to  favor  the  better 
voters.  Some  weighting  schemes  are  discussed  in  section  3.8. 


3.5  Tree  Based  Supra- Classifiers 

In  a  traditional  decision  tree  classifier,  the  strategy  is  to  divide  an  input 
space  into  “hyper-rectangular”  target  class  regions  of  high  class  “purity”. 
Branching  decisions  are  based  on  how  much  each  input  feature  increases 
the  class  “purity”  of  examples  in  resulting  subregions.  The  supra-classifier 
framework  is  a  very  intuitive  application  of  tree  based  classifiers  in  that  it 
shares  the  goal  of  creating  target  class  regions  of  high  class  purity,  but  the 
regions  are  not  simple  “hyper-rectangles”  as  would  be  found  in  a  real  valued 
input  space.  Instead,  target  class  regions  are  defined  by  how  the  support 
classifiers  partition  the  image  feature  input  space. 

Consider  that  any  single  support  classifier  c&(-)  partitions  the  input  space 
lZd  into  regions  labeled  for  the  classes  of  classification  task  b .  Recall  from 
the  above  discussion  on  the  intuition  of  the  supra-classifier  architecture  that 
if  Cfc(-)  has  (even  a  small  amount  of)  relevant  knowledge  to  contribute  to 
the  target  classification  task  and  a  good  relevance  measure  is  chosen,  then 
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it  should  partition  the  input  space  into  subregions  of  greater  purity  (of  the 
target  classes)  than  would  a  random  partitioning. 

Consider  the  extension  to  the  set  of  \B\  support  classifiers  that  define  a  set 
of  \B\  overlapping  partitions  of  the  input  space.  The  overall  result  is  a  par¬ 
titioning  of  the  input  space  consisting  of  |/?|-way  “intersection  regions”.  It  is 
easy  to  see  that  as  \B\  increases,  the  average  size  of  these  intersection  regions 
will  decrease  as  their  number  increases.  If  each  of  the  \B\  support  classifiers 
has  contributed  some  amount  of  unique  knowledge,  then  the  premise  of  tree 
classifiers  is  that  the  average  class  purity  of  the  intersection  regions  will  also 
increase.  A  hypothetical  example  can  be  seen  in  Figure  5  where  a  two  di- 


Fig.  5.  A  hypothetical  2-D  feature  space  that  has  been  partitioned  by  4  different 
support  classifiers,  identified  by  the  different  grayscales  of  the  partition  boundaries. 


mensional  feature  space  of  two  target  classes  “X”  and  “0”  (similar  to  figure 
3)  is  partitioned  by  different  four  support  classifiers,  the  boundaries  of  which 
are  represented  by  the  four  line  grey  levels.  Here  it  can  be  seen  that  class 
purity  of  each  intersection  region  us  higher  when  more  support  classifiers  are 
considered. 

A  difficulty  with  decision  trees  is  that  at  each  additional  partitioning,  the 
number  of  image  samples  in  each  region  intersection  tends  to  drop.  Thus, 
after  only  a  relatively  few  features,  there  may  no  longer  be  sufficient  samples 
to  make  probability  estimated  upon  which  further  branching  is  based.  Most 
practical  decision  tree  classifiers  order  the  branching  by  decreasing  relevance, 
conditional  on  the  branches  already  taken.  A  commonly  used  relevance  mea¬ 
sure  is  mutual  information  of  each  support  classifier  with  the  target.  Previous 
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work  [5]  has  given  empirical  evidence  that  this  relevance  measure  is  also  ef¬ 
fective  in  a  supra-classifier  framework. 


3.6  Similarity  Based  Supra-Classifiers 

Probability  based  supra-classifiers  tend  to  have  the  same  strengths  and  weak¬ 
nesses.  If  the  set  of  support  classifiers  for  the  target  classification  task  is  large 
and  cannot  be  reduced  by  feature  selection  or  compensated  for  by  indepen¬ 
dence  assumptions  or  smoothing,  then  it  may  not  be  appropriate  to  use  these 
techniques.  Instead,  it  may  be  better  to  use  similarity  based  supra-classifier 
techniques  instead.  These  techniques  consist  of  defining  distance  measures 
between  images  in  the  space  of  support  classifier  labels,  and  then  making  a 
final  target  classification  based  on  these  distances.  Ideally,  images  that  are 
similar  would  have  a  small  distance  between  them,  while  dissimilar  images 
would  be  far  from  each  other.  For  the  purposes  of  classification,  an  optimal 
distance  measure  d(x,  y)  between  two  images,  x  and  y,  for  target  classification 
function  t(-)  would  have  the  property: 

d(x,y\t(x)  ±  t(y))  >  d(x,y\t(x)  =  t(y)).  (8) 

for  all  value  pairs  of  x  and  y  in  the  set  of  images.  Equation  8  states  that  if  two 
images  are  of  the  same  target  class,  the  distance  between  them  will  always 
be  less  than  if  they  are  of  are  differing  target  classes.  The  challenge  then,  is 
to  find  a  good  distance  measure  and  build  an  appropriate  supra-classifier  to 
achieve  satisfaction  of  Equation  8. 


Hamming  Nearest  Neighbor  Supra-Classifier.  The  Hamming  Nearest 
Neighbor  (HNN)  is  a  simple  classifier  for  discrete  features  (e.g.  support  clas¬ 
sifier  labels)  similar  to  a  traditional  nearest  neighbor  which  operates  in  a 
Euclidean  space.  If  /(*)  is  the  indicator  function,  then  the  (Hamming)  dis¬ 
tance  between  two  samples  xtrain  and  xtest  can  be  calculated  as 

&\{B}\(%  train  latest)  ~  ^  ^  I{Cb{xtrain)  7^  Cb{Xtest))‘ 

6:6=1  ...|{n}| 

For  each  test  sample,  the  Hamming  Nearest  Neighbor  (HNN)  supra-classifier 
will  choose  the  class  label  of  the  training  sample  with  the  smallest  Ham¬ 
ming  distance  from  it.  There  is  no  need  to  estimate  probabilities  as  in  the 
probability  based  supra-classifiers. 

Recent  analysis  gives  indication  that  the  Hamming  distance  as  used  in 
the  HNN  classifier  approaches  the  optimal  distance  measure  [4,5].  One  result 
of  this  analysis  can  be  summarized  in  the  following  theorem: 

Theorem: 

If  the  support  classifiers  {c&(-)}  :  b  =  1...|{B}|  are  independent  of  each 
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other  conditionally  on  the  target  class  £(•),  we  are  given  three  images  xa,  x$, 
and  x7,  chosen  randomly  and  independently  from  some  distribution,  t(xa)  = 
t(x*y)  ^  t(xp ),  and  the  priors  for  the  target  classes  are  equal,  then 

lim  P(Dm\(x0,xy)  >  D|{B}|(a:0,x7))  =  1. 

|{o}Hoo 

Proof  of  this  theorem  is  described  in  [5].  This  theorem  states  that  in  the 
limit  as  more  relevant,  independent  support  classifiers  become  available,  the 
probability  that  the  Hamming  distance  between  a  training  and  test  sample 
of  different  target  classes  will  be  greater  than  the  distance  between  the  test 
sample  and  a  training  sample  of  the  same  class  approaches  1.  It  should  also 
be  noted  that  this  theorem  holds  even  if  there  is  only  one  training  sample 
of  each  target  class  and  even  if  all  of  the  support  classifiers  are  only  very 
weakly  relevant  to  the  target  classification  task.  Noise  from  totally  irrelevant 
classifiers  will  tend  to  cancel  itself  out  as  long  as  the  independence  assumption 
holds. 

Thus,  the  HNN  supra-classifiers  is  useful  even  in  the  extreme  bottom  right 
corner  of  the  knowledge  space  described  in  Figure  4.  Furthermore,  a  simple 
application  of  the  Hoeffding  inequality  [18]  is  able  to  place  an  exponential 
upper  bound  on  the  convergence  rate  of  the  HNN  supra-classifier  as  function 
of  the  average  relevance  of  the  support  classifiers. 

Despite  this  potentially  powerful  result,  a  few  caveats  are  in  order.  First, 
the  existence  of  an  infinite  number  of  independent,  relevant  support  classi¬ 
fiers  is  only  possible  if  the  classification  problem  has  zero  Bayes  error.  Also, 
the  HNN  may  not  perform  well  if  a  few  strong  features  can  be  selected  for 
the  target  task  (effectively  described  in  the  upper  left  corner  of  Figure  4), 
since  it  assumes  a  more  uniform  distribution  of  relevant  knowledge  among 
the  support  classifiers.  It  is  possible  to  weigh  the  indicator  functions  in  the 
Hamming  distance  by  the  relevance  of  the  support  classifiers  to  compensate 
for  a  non-uniform  distribution  of  knowledge,  thus  moving  its  applicability 
leftward  in  the  knowledge  space. 


3.7  Use  of  Other  Supra-CIassifier  Types 


The  problem  of  building  a  supra-classifier  is  simply  the  problem  of  building 
a  classifier  that  can  effectively  use  (perhaps  a  large  number  of)  categorical 
inputs  features.  Many  classifiers,  even  if  not  specifically  designed  for  cate¬ 
gorical  input,  may  be  used  if  an  appropriate  representation  for  the  support 
classifier  labels  is  made.  For  example,  the  Multilayer  Perceptron  (MLP)  and 
Radial  Basis  Function  (RBF)  classifiers  expect  their  input  to  be  a  vector  of 
real  values,  so  a  simple  “1-of-M”  encoding  of  the  support  classifier  output 
labels  can  be  used. 
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3.8  Relevance  Measures  and  Their  Uses 

In  many  cases  the  support  classifiers  will  vary  widely  in  their  usefulness  in 
assisting  a  supra-classifier.  Many  of  the  supra-classifiers  can  benefit  from 
(and  some  even  require)  knowing  the  relevance  of  the  support  classifiers  to 
build  a  practical  system.  Relevance  of  a  support  classifier  or  set  of  support 
classifiers  is  a  measure  of  its  ability  to  improve  the  classification  accuracy  of 
a  supra-classifier.  In  general,  attempts  to  make  relevance  measures  on  sets 
of  many  support  classifiers  fall  prey  to  the  same  curse  of  dimensionality  that 
the  ideal  probabilistic  supra-classifier  does  because  many,  if  not  all,  of  these 
measures  depend  on  probability  estimates.  Thus,  we  will  make  independence 
assumptions  as  needed  so  that  we  may  only  consider  relevance  measures  of 
single  support  classifiers.  Depending  on  the  specific  type  of  measure,  relevance 
can  be  used  to  weight  support  classifiers  for  purposes  such  as  ranking  of 
support  classifiers  for  feature  selection  or  weighting  of  terms  in  a  Hamming 
distance  for  the  HNN  supra-classifier. 


Mutual  Information.  The  information  theoretic  measure  mutual  informa¬ 
tion  is  a  measure  of  “shared  knowledge”  between  two  random  variables.  A 
standard  definition  of  mutual  information  between  random  variables  U  and 
V  in  bits  is 


I(U,V)  =  '£/P(U  =  u,V  =  v)  log2 

u,u 


P(U  =  U,V  =  v) 
P(U  =  u)P(V  =  v)' 


(9) 


If  I(Tr,Cb)  >  I{TA,C’b)  where  Ch  =  cb(XT),  Cb  =  cb( XT),  and  Tr  =  t{Xr\ 
then  we  say  that  cb  “knows”  more  about  tT  than  does  c'b.  From  Fano’s  in¬ 
equality  [9],  we  also  know  that  in  this  case,  an  information  theoretic  upper 
bound  on  performance  of  a  classifier  built  to  perform  classification  task  r 
using  only  the  information  from  cb  is  higher  than  for  one  built  using  only 
the  information  from  cb.  As  mentioned  earlier,  this  is  a  commonly  used  rel¬ 
evance  measure  in  decision  tree  construction,  where  the  measure  is  made 
conditionally  on  all  of  the  previously  taken  branches. 


A  Value  Distance  Metric.  Stanfill  and  Waltz  [30]  introduced  a  Value  Dis¬ 
tance  Metric  (VDM)  to  measure  the  distance  between  two  discretely  valued 
vectors  for  instance  based  learning  (IBL)  methods,  making  is  applicable  to 
similarity  based  supra-classifiers  such  as  the  HNN.  This  metric  considers  the 
differences  between  the  frequencies  of  each  target  class  occurring  over  the 
target  training  set  conditional  on  the  value  of  each  feature  (support  classifier 
label),  summed  over  all  of  the  support  classifiers.  Consider  the  labels  Cfe(x) 
and  cb{y)  of  a  single  support  classifier  b  for  two  images  x  and  y.  In  a  supra- 
classifier  framework,  the  VDM  defines  the  distance  between  these  two  values 


To  appear  in:  Soft- computing  anddmage  processing ,  S.K.  Pal  and  A.  Ghosh,  and  M.  K.  Kundu,  Eds.,  Springer- Ver lag,  1 


to  be: 


db(cb(x)My ))  =  E  \FP  =  *1^  =  <%(*))  -  F(?  =  *\c»  =  c6(J/))*-(10) 

t= 1 

Where  the  F(-)  are  the  sample  probabilities  of  each  event  over  the  image 
training  set.  Often,  the  constant  fc  =  1  is  used.  Thus,  for  every  support 
classifier  with  |{<ST}|  possible  labels,  there  is  a  |{<ST}|  x  |{<Sr}|  matrix  of 
distance  values  o?&(v)*  Stanfill  and  Waltz  also  included  a  weighting  term 
w9f  which  made  <4(*,  •)  asymetric.  A  priori  knowledge  is  required  to  use  this 
weight  effectively,  and  its  exclusion  keeps  symmetric. 

Using  the  above  metric  calculated  for  all  of  the  support  classifiers  over 
the  image  training  allows  the  total  distance  between  two  images  to  be: 

IWI 

D(cb(x),cb(y))  =  ^  wxwydb(cb(x),cb(y))r  (11) 

6=1 

where  wx  and  wy  are  weights  on  the  images  themselves  and  r  determines  a 
norm  (e.g.  r  =  2  means  Euclidean  distance).  In  some  IBL  methods,  these 
weights  can  be  used  to  favor  those  images  that  help  discriminate  the  target 
classes  better.  In  [8]  a  modified  VDM  (MVDM)  demonstrates  empirically  the 
usefulness  of  the  VDM  with  the  HNN  classifier  in  an  IBL  context. 

4  Experiments 

In  order  to  demonstrate  knowledge  reuse  in  the  supra-classifier  framework,  we 
have  chosen  two  classification  tasks,  one  each  for  the  supra-classifier  frame¬ 
work’s  two  major  application  areas.  The  first  application  is  the  enhancement 
of  classification  performance  of  a  new  classifier  related  to  previously  con¬ 
structed  classifiers.  For  this,  a  collection  of  binary  classifiers  of  images  of 
military  vehicles  is  used  to  aid  in  the  creation  of  a  similar  such  classifier. 
Second,  previous  classification  labeling  of  images  in  a  database  by  a  human 
user  are  used  to  predict  current  classifications  of  interest  to  that  person  on 
the  same  database.  These  predictions  could  then  be  used  to  recall  specific 
images. 


4.1  Target  Recognition 

The  goal  here  is  to  build  a  classifier  to  discriminate  between  two  classes 
of  military  vehicles  which  are  labeled  HMMWV  and  2S1.  The  sources  of 
knowledge  available  are  a  training  set  of  second  generation  FLIR  images  of 
outdoor  scenes  containing  these  two  types  of  vehicles  and  a  collection  of  ten 
previously  built  vehicle  discriminators.  The  images  were  segmented  to  extract 
only  the  immediate  region  around  the  vehicles  and  each  such  sample  is  then 
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represented  by  47  scalar  features,  including  23  Zernike  moments,  7  standard 
moments,  6  normalized/central  moments,  and  other  assorted  features  such 
as  average  intensity,  height,  width,  etc  [28].  The  training  set  for  the  new 
classifier  consists  of  20  examples  of  HMMWV  and  75  examples  of  2S1. 

All  of  the  support  classifiers  are  multilayer  perceptron  (MLP)  two-class 
neural  networks  that  have  been  constructed  to  discriminate  between  the  fol¬ 
lowing  pairs  of  vehicle  types;  “M35  and  HMMWV” ,  “M35  and  M60” ,  “M35 
and  ZSU”,  “M35  and  M730”,  “HMMWV  and  M60”,  “HMMWV  and  ZSU”, 
“HMMWV  and  M730”,  “M60  and  ZSU”,  “M60  and  M730”,  and  “ZSU  and 
M730”.  Figure  4.1  shows  sample  images  from  the  five  classes.  The  M35  is 
a  truck,  the  M60,  2S1,  and  M730  are  tanks,  the  HMMWV  is  a  “hummer” 
transport,  and  the  ZSU  is  a  Soviet  anti-aircraft  launcher.  Note  that,  while 
some  of  these  discriminators  include  HMMWV  as  a  class,  none  include  the 
2S1. 


Fig.  6.  Examples  of  preprocessed  second  generation  FLIR  images  used  for  the  tar¬ 
get  recognition  problem:  From  top,  left  to  right:  2S1,  HMMWV,  M35,  M60  and 
M730. 

Careful  choosing  and  parameter  hand  tuning  of  a  simple  MLP  classifier 
allows  a  classification  rate  of  about  98.5%  using  all  of  the  training  exam¬ 
ples  as  a  knowledge  source.  As  mentioned  earlier,  classifier  knowledge  reuse 
is  most  useful  when  there  is  a  dearth  of  knowledge  from  the  training  exam- 
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pies.  Thus,  we  created  an  experimental  setup  that  purposefully  held  back 
some  of  the  training  examples  to  see  if  the  knowledge  from  the  previously 
constructed  classifiers  could  compensate  for  the  loss  of  training  set  informa¬ 
tion.  The  number  of  available  training  examples  ranged  from  4  to  32,  evenly 
distributed  among  the  two  target  classes.  We  trained  three  traditional,  un¬ 
aided  target  classifiers  for  comparison:  an  MLP,  a  traditional  single  nearest 
neighbor  classifier,  and  a  C4.5  decision  tree. 

Eleven  support  classifiers  were  available;  the  ten  previous  constructed 
classifiers  and  an  unaided  MLP  target  classifier,  chosen  because  it  was  a  good 
performer  in  informal  testing.  Several  supra-classifiers  types  were  constructed 
including  C4.5,  MLP,  the  combiner  based  (VOTE),  naive  Bayes  (BAYES), 
and  Hamming  nearest  neighbor  (HNN)  classifiers.  The  target  class  examples 
were  randomly  divided  into  training  and  test  examples.  The  supra-classifiers 
and  unaided  classifiers  were  constructed  using  the  training  examples  and 
tested  on  the  rest.  This  was  iterated  for  500  trials  for  each  quantity  of  training 
examples  considered.  Figure  4.1  shows  the  classification  rate  of  the  various 


Fig.  7.  Classification  rate  of  severed  supra-classifiers  and  unaided  classifiers  versus 
the  number  of  target  training  examples,  for  the  target  recognition  problem. 


supra  and  unaided  classifiers  on  the  test  set  versus  the  the  number  of  target 
training  examples  available.  For  very  few  training  examples,  all  but  the  C4.5 
supra-classifiers  provided  a  substantial  performance  improvement  over  all  of 
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the  unaided  classifiers  with  the  combiner  (VOTE)  followed  by  the  naive  Bayes 
supra-classifiers  demonstrating  the  highest  overall  performance.  The  unaided 
MLP  classifier  was  a  superior  performer  to  the  unaided  nearest  neighbor  and 
C4.5  classifiers.  As  more  training  examples  became  available,  unsurprisingly 
the  benefit  of  knowledge  reuse  diminished,  since  there  was  more  knowledge 
available  from  the  training  set.  The  results  shown  are  statistically  significant, 
but  for  neatness,  the  error  bars  are  not  shown.  These  results  give  evidence 
that  when  an  inadequate  target  training  set  results  in  poor  classification 
performance,  knowledge  reuse  can  help. 

4.2  Building  An  Image  Knowledge  Base 

Although  the  supra-classifier  knowledge  reuse  framework  can  help  in  the  con¬ 
struction  of  new,  related  classifiers,  perhaps  its  best  application  is  to  the  con¬ 
struction  of  a  knowledge  base  for  a  (possibly  fixed)  set  of  images.  Consider 
a  fixed  set  of  images  such  as  works  of  art  for  which  multiple  classifications 
have  been  made  by  human  experts,  artificial  classifiers,  and  other  types  of 
systems.  For  a  particular  image,  the  set  of  classification  labels  can  act  as  a 
powerful  description  that  can  be  used  in  understanding  it.  Suppose  one  knew 
the  answer  to  dozens  or  perhaps  even  hundreds  of  categorizations  of  a  pho¬ 
tograph  (e.g.  indoor  or  outdoor,  natural  or  man-made,  whether  it  contains 
people,  etc.).  An  “internal  representation”  as  to  what  the  photograph  was 
about  could  be  formed  and  perhaps  one  could  even  correctly  answer  novel 
questions  about  it,  all  without  ever  actually  having  seen  the  image.  This  novel 
desired  classification  could  then  be  used  to  recall  images  from  the  database. 

The  supra-classifier  knowledge  reuse  framework  provides  some  of  the  tools 
to  construct  a  system  with  such  an  ability.  Given  an  image  database  where 
each  image  is  annotated  with  a  large  number  of  classifications,  a  user  could 
manually  classify  a  few  positive  and  negative  examples  of  some  novel  concept 
of  interest,  which  would  become  a  training  set  for  a  supra-classifier.  The 
major  challenge  of  building  a  supra-classifier  for  a  database  of  this  sort  is 
designing  one  that  can  effectively  use  a  large  number  of  support  classifiers 
with  only  a  small  number  of  training  samples.  This  case  of  a  small  number  of 
high  dimensional  training  samples  corresponds  to  the  lower  right  hand  corner 
of  the  knowledge  space  described  in  Figure  4. 

In  order  to  demonstrate  how  a  supra-classifier  framework  can  be  used 
as  part  of  an  image  knowledge  base  system,  a  data  set  of  30  color  images 
(primarily  photographs)  from  the  authors5  personal  collection  and  from  a 
commercial  CD-ROM  was  assembled.  These  images  were  chosen  to  be  (sub¬ 
jectively)  as  diverse  as  possible,  and  some  of  them  can  be  seen  in  Figure  4.2. 
We  defined  71  sets  of  potential  categorizations  for  these  images  that  repre¬ 
sented  mutually  exclusive  concepts.  Examples  include  “Big  vs.  Small”,  “Clean 
vs.  Dirty”,  “Busy  vs.  Calm”,  and  “Solid  vs.  Liquid  vs.  Gas”.  A  web  site 
(http://www.lans.ece.utexas.edu/cgibin/cgiwrap/kdb/top.pl)  was  created  to 
present  the  30  images  to  six  human  users,  who  were  asked  to  classify  the 


To  appear  in:  Soft- computing  and  image  processing,  S.K.  Pal  and  A.  Ghosh,  and  M.  K.  Kundu,  Eds.,  Springer-Verlag,  1. 


Fig.  8.  Nine  of  the  30  images  in  the  data  set. 


images  in  each  of  the  71  ways.  These  classifications  constitute  a  “personal 
profile”  of  knowledge  about  the  images  for  each  user,  essentially  making  the 
users  become  their  own  “support  classifiers”.  A  72nd  classification  was  also 
made  by  each  user  for  each  image  to  act  as  a  test  “target”  for  novel  classifi¬ 
cations.  For  this  target,  the  users  were  asked  to  decide  whether  they  “liked” 
each  image  more  than  the  “average”  image  in  their  judgement  or  not.  A 
supra-classifier  would  use  these  target  classifications  as  a  training  set  and 
function  as  the  judge  of  which  images  to  recall  based  on  whether  the  user 
would  “like”  each  image.  Although  for  demonstration  purposes,  the  six  users 
were  asked  to  make  a  target  classification  for  all  of  the  30  images,  in  a  real 
system,  hopefully  only  a  few  of  such  classifications  should  be  needed  if  there 
is  already  enough  information  from  a  large  number  of  support  classifiers. 


First  Experiment  -  Number  of  Training  Samples.  The  30  photographs 
were  randomly  split  into  training  set  and  test  sets  of  varying  sizes  so  that  the 
dependency  of  supra-classifier  performance  on  the  number  of  training  exam¬ 
ples  could  be  explored.  The  training  set  size  ranged  from  5  to  25  images  with 
the  rest  held  as  test  images.  We  then  used  several  types  of  supra-classifiers 
to  predict  whether  a  subject  would  “like”  each  test  image.  These  included  a 
multilayer  perceptron  neural  network  (MLP),  a  C4.5  decision  tree  classifier 
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(C4.5),  a  Naive  Bayes  classifier  (BAYES),  the  combiner  (VOTE),  a  Hamming 
nearest  neighbor  (HNN)  and  a  simple  baseline  classifier  that  always  guessed 
the  most  common  class  in  the  training  set  (MCC).  500  trials  (random  splits 
of  the  image  sets)  for  each  training  set  size  and  for  each  of  the  six  users  were 
performed.  The  average  classification  test  rate  over  all  500  trials  for  each  type 
of  supra-classifier  versus  the  number  of  available  training  examples  is  shown 
in  Figure  9.  Here  we  can  see  that  the  HNN  classifier  performs  better  than 
other  classifiers  when  there  are  very  few  training  samples,  with  this  margin  of 
superiority  waning  as  the  number  of  training  samples  grows.  As  more  train- 


Fig.  9.  Test  rate  versus  number  of  training  samples  for  each  of  the  classifiers. 


ing  images  became  available,  performance  of  all  of  supra-classifiers  improved, 
although  to  varying  degrees. 

Second  Experiment  -  Number  of  Support  Classifiers.  To  study  sce¬ 
narios  where  only  a  few  training  examples  but  a  very  large  number  of  support 
classifiers  are  available,  we  used  a  similar  setup  as  in  the  first  experiment  but 
with  the  number  of  training  samples  held  at  five.  The  number  of  support 
classifiers  was  varied  to  measure  supra-classifier  scalability  with  increasing 
input  dimensionality.  200  trials  of  random  training/ test  set  splits  for  each  of 
several  quantities  of  support  classifiers  ranging  from  4  to  71  for  one  of  the 
users  was  performed.  The  average  classification  test  rate  over  the  trials  for 
the  “like/don’t  like”  labeling  versus  the  number  of  available  support  clas¬ 
sifiers  is  shown  in  Figure  10.  Beyond  36  support  classifiers,  only  the  HNN 
supra-classifier  continued  to  show  improvement  and  most  of  the  other  supra- 
classifiers  were  not  scalable  beyond  a  small  number  of  support  classifiers. 
The  naive  Bayes  classifier  actually  got  worse,  probably  due  to  failed  inde¬ 
pendence  assumptions.  All  but  the  naive  Bayes  and  MLP  supra-classifiers 
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Number  of  Features 

Fig.  10.  Test  rate  versus  number  of  input  features  for  each  of  the  classifiers. 


however,  had  statistically  significantly  better  performance  than  the  baseline 
MCC  classifier. 

With  a  very  large  number  of  support  classifiers,  one  would  hope  that  by 
only  classifying  a  small  number  of  training  images,  other  images  of  the  de¬ 
sired  class  could  be  retrieved.  This  experiment  is  germane  to  such  practical 
applications  of  supra-classifier  knowledge  reuse  because  even  though  most  of 
the  image  classifications  in  this  experiment  were  quite  subjective  in  nature, 
a  supra-classifier  was  able  to  use  knowledge  implicit  in  these  subjective  clas¬ 
sifications  to  classify  on  a  novel  concept  much  better  than  random  guessing. 

5  Summary  and  Recommendations 

The  problem  of  insufficient  domain  knowledge  poses  a  challenge  in  many  im¬ 
age  classification  problems.  Classifier  knowledge  reuse  is  discussed  as  a  pos¬ 
sible  additional  source  of  domain  knowledge  beyond  traditional  training  set 
and  a  priori  knowledge  sources.  It  provides  a  more  automated  process  for  the 
inclusion  of  large  amounts  of  high  level  domain  knowledge  that  are  implicit 
in  existing  classifiers.  The  supra-classifier  framework  is  proposed  as  an  ap¬ 
proach  to  practical  classifier  knowledge  reuse.  Several  issues  of  supra-classifier 
design  and  potential  supra-classifier  architectures  are  discussed,  including  the 
Hamming  nearest  neighbor  classifier  which  demonstrates  scalability  to  large 
amounts  of  classifier  domain  knowledge.  Experiments  showing  two  types  of 
applications  of  supra-classifier  knowledge  reuse  are  presented.  The  first  shows 
how  to  enhance  a  novel  classifier’s  performance  by  reusing  knowledge  and  the 
second  examines  how  the  supra-classifier  framework  can  be  used  to  estimate 
human  subjective  classifications  of  images  in  an  image  database  of  fixed  com¬ 
position.  The  experiments  indicate  that  there  is  no  single  ideal  supra-classifier 
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architecture,  although  the  Hamming  nearest  neighbor  did  demonstrate  excel¬ 
lent  performance  in  the  traditionally  difficult  to  handle  case  of  high  dimen¬ 
sionality  and  low  training  sample  size.  This  motivates  further  investigation 
of  the  HNN  architecture. 

Currently,  when  novel  image  classifiers  are  built,  most  previous  classifiers 
that  may  be  relevant  to  the  new  task  are  ignored  or  are  simply  unavailable. 
The  construction  of  a  “database  of  image  classifiers”  would  be  a  means  for 
those  who  have  built  and  those  who  need  to  build  image  classifiers  to  im¬ 
plicitly  collaborate  by  using  existing  image  classifiers  from  the  database  and 
contributing  newly  created  ones. 
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